Let's read the numbers in Italy:
Cancer numbers in Italy 2021 confirm that breast cancer is the most diagnosed neoplasm in women, in which about one in three malignancies (30%) is breast cancer.
According to data from the report Cancer numbers in Italy 2021 in Italy there are an estimated 55,000 new diagnoses of female breast cancer in 2020 and 12,500 deaths are estimated in 2021. The net survival 5 years after diagnosis is 88%.
According to ISTAT data, in 2018 breast cancer represented, with 13,076 deaths, the leading cause of death from cancer in women.
Since the end of the nineties there has been a continuous trend towards a decrease in mortality from breast cancer (-0.8% / year), attributable to a greater diffusion of early diagnosis programs and therefore to diagnostic anticipation and also to therapeutic progress.
Humans have known about breast cancer for a long time. For example, Edwin Smith's surgical papyrus describes cases of breast cancer. This medical text dates back to 3000-2500 BC. Hippocrates described the stages of breast cancer in the early 400 BC. In the first century, doctors experimented with surgical incisions to destroy tumors.
Our modern approach to breast cancer treatment and research began to take shape in the 19th century. We consider these milestones:
1882: William Halsted performs the first radical mastectomy. This surgery will remain the standard operation for breast cancer treatment until the 20th century.
1895: The first X-ray is taken. Eventually, low-dose X-rays called mammograms will be used to detect breast cancer.
1937: In addition to surgery, radiotherapy is used to spare the breast. After removing the tumor, needles with radius are inserted into the breast and near the lymph nodes.
2013: The four major subtypes of breast cancer are defined as HR + / HER2 ("luminal A"), HR- / HER2 ("triple negative"), HR + / HER2 + ("luminal B") and HR- / HER2 + ("HER2-enriched ").
2018: A clinical study suggests post-surgery chemotherapy does not benefit 70% of women with early-stage breast cancer.
Breast cancer treatment is becoming more personalized as doctors learn more about the disease. It is now seen as a disease with subtypes that have different patterns and ways of acting on the body. The ability to isolate specific genes and classify breast cancer is the start of more personalized treatment options. Special tests can also tell doctors more about breast cancer.
In today's world, especially among young university colleagues, having a method with a performance close to 99% has become a challenge and the only goal to be achieved. Obviously, putting an excellent model with good predictive skills into production is the goal of any data scientist, but too often we forget that data science is only the means and not the end. If we really want this science to accompany us in everyday life, we must begin to see it only as a tool with which to solve the problems that are placed before us and we must understand that the study and analysis of the latter is much more important of the technique used to solve them.
For this reason I have chosen to bring a problem of the healthcare sector, which is the sector that can really make us understand how useful these techniques are to help us in daily life and above all make us understand that improving the forecasting capacity of a model in this case can it means saving a human life.
Data Science is not R. Data Science is not Python. Data Science is not SQL. Data Science is not Spark. Data Science is not TensorFlow.
Data Science is using above tools and techniques, and if required inventing new tools and techniques, to solve a problem using "data" in a "scientific" way.
The key challenge of this study is the classification of breast tumors as malignant (cancerous) or benign (non-cancerous) on Breast Cancer Wisconsin (Diagnostic) Dataset which contains only quantitative features. These are the features we have available; id is the identification number of the patient, we drop it because it is not useful for the analysis and the target variable we are interested in predicting is diagnosis which refers to the state of the cancer. We treat the reference variable as a factor based on two levels, 1 if it assumes value M and 0 if it assumes value B.
Rows: 569
Columns: 32
$ id <int> 842302, 842517, 84300903, 84348301, 84358402, ~
$ diagnosis <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "~
$ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450~
$ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9~
$ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, ~
$ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, ~
$ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0~
$ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0~
$ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0~
$ concave.points_mean <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0~
$ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087~
$ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0~
$ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345~
$ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902~
$ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18~
$ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.~
$ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114~
$ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246~
$ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0~
$ concave.points_se <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188~
$ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0~
$ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051~
$ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8~
$ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6~
$ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,~
$ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, ~
$ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791~
$ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249~
$ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0~
$ concave.points_worst <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0~
$ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985~
$ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0~
The diagnosis of breast tumors has traditionally been performed by a fully biopsy, an invasive surgical procedure. Fine needle aspirations (FNAs) is a type of biopsy procedure which consists of inserting a thin needle into an area of abnormal-appearing tissue or body fluid. The aspirated material is then expressed onto a glass side and stained. The image for digital analysis is generated color video camera mounted atop an Olympus microscope and the image is projected into the camera. Then the image is captured.
The first step in successfully analyzing the digital image is to specify an accurate location of each cell nucleus boundary. The actual boundary of the cell nucleus is located by an active contour model known in the literature as a Snake. This is a feature extraction technique which initially identifies the approximate Region Of Interest (ROI), by removing the non-tumor part, and then minimizes an energy function defined over the arclength of a closed curve. The energy function is defined in such a way that the minimum value occurs when the curve accurately corresponds to the boundary of a cell nucleus.
Here, we can see a pratical example of lesion segmentation on a breast MRI (Magnetic Resonance Imaging) scan:
In the computerized segmentation section, FCMs clustering based method is used to produce an initial segmentation of the input image, while the gradient vector flow (GVF) snake model is applied to the initial segmentation to obtain the final segmentation. The initial segmentation method is referred to as the FCMs-based and the final segmentation method is referred to as the GVF-FCMs for short. The segmentation performance of both methods is evaluated with manual segmentation by experienced radiologists on dynamic contrast-enhanced (DCE) MRI.
This approach is very powerful with a very low average time cost and dynamic memory cost.
A set of 569 images has been processed in the manner described above and we have a dataset with 569 number of observations.
The computer vision diagnostic system extracts ten different features from the snake-generated cell nuclei boundaries. All of the features are numerically modeled such that larger values will typically indicate a higher likelihood of malignancy.
The extracted features are as follows.
As we can see above, the number of features increases because some statistical indices are calculated as, mean and standard deviation, on the 10 starting variables. The final dataset contains 32 features.
We analyze the features and try to understand which features have larger predictive value considering the degree of separation of the two classes, on each feature. We don't have a perfect separation with any feature, but there are different good features like Concave Points Worst, Concavity Worst, Perimeter Worst, Radius Worst, Concavity Mean, Concave Points Mean, Area Mean, Perimeter Mean.
There are also features that have almost perfectly overlapping values for the two classes, such as Symmetry Se , Smoothness Se, Symmetry Mean and Texture Se.
We let's look at these features with a scatterplot conditionally on the status of the cancer.
At this step, we compute the correlation matrix.
The correlation is measured considering the Pearson formula: \[ \rho_{XY} = \frac{Cov(X,Y)}{\sigma_X \cdot \sigma_Y} \] Where:
Perimeter Mean and Radius Worst.
Area Worst and Radius Worst.
Perimeter Worst and Radius Worst, Perimeter Mean, Area Worst, Area Mean, Radius Mean.
Now, we can see the plots for some of these highly correlated features.
Positive Correlated Features
Uncorrelated Features
Negative Correlated Features
We have too many features and this is a problem for the visualization but also for the interpretability of the model. The principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, using only the first few principal components which cumulate a chosen variance threshold. In this project we use this tool only for the visualization of the features, in particular we want to show the two groups of patients in the transformed space of the first two principal components.
We have a vector \(x_i \in \mathbb{R^p}\), and the goal is to find a way to map data such that: \[ x_i \rightarrow z_i \in \mathbb{R^q}, \,\,\, with \,\,\, q < p \] making sure that:
some relevant informations about the distribution of \(X\) are preserved.
Goal:
transform the original variables \(X = (X_1, X_2, ..., X_p)'\), into \(Z = (Z_1, Z_2, ..., Z_p)'\).
the transformation is linear \[ Z_j = \phi_{1j}X_1 + \phi_{2j}X_2 + ... + \phi_{pj}X_p \] it is called \(j^{th}\) principal component. The coefficients: \[ \phi_{j} = (\phi_{1j}, \phi_{2j}, ..., \phi_{pj})' \] they are called loadings of the \(j^{th}\) principal component.
The principal components are:
sorted by decreasing variance: \(Var[Z_1] \ge Var[Z_2] \ge ... \ge Var[Z_p]\)
uncorrelated: \(Cov[Z_l, Z_m] = 0 \,\,\, \forall \,\,\, l \neq m\).
Now, we want to analyze the behavior of the data transformed with the PCA in the new dimension. We are interested in understanding the relationship between the original variables and the principal components and choosing the appropriate number of components such as to accumulate a certain variance value.
The biplot is a visualization that allows to see the original units in the space of the principal components and overlap the directions of the original \(X\) variables in the space of the new features \(Z\).
Looking at the information on the variability that each represents component, we can say that the first component represents \(44.3\, \%\) of the variability, while the second component \(19 \, \%\). Vectors show us in which direction the original variables move in the \(Z\) domain. When \(Z_1\) and \(Z_2\) decrease, we are in a situation where malignant cancer diagnoses increase.
In particular, we can identify two groups of variables:
the first is located in the upper left and is strongly correlated negatively with the first component and positively with the second component.
the second group of variables is negatively correlated with both the first and the second component.
In order to choose the number of principal components, we can look at the scree plot, which represents the amount of variability explained by each principal component with respect to the total variability contained in the data.
Now, we want to analyze the contribution of the original variables with respect to the first four components.
As we can see, the variables that explain most of the information along the first principal component are Concave Points Mean, Concavity Mean, Concave Point Worst, Compactness Mean and so on. This information is consistent with what was said previously for the biplot, as the variables that are most correlated to the principal component we are examining are also those that have a greater contribution, i.e. those with a longer direction vector.
The main goal is to do a full Bayesian analysis, based on understanding whether a cancer is cancerous or non-cancerous. To evaluate the predictive capacity of the model, it was decided to split the dataset into training (80%) and test (20%). The createDataPartition command of the caret package was used to maintain the same balance of the reference classes, both for training and for test set.
From the correlation analysis, we noticed that there are many highly correlated features between them that can lead us to the problem of multicollinearity, that is, when in the specified model the predictors are correlated with each other. To solve this problem, we consider all pairs of variables and drop the features with correlation value \(\rho_{XY} \ge 0.7\). To do this we used the findCorrelation function of the caret package.
As we can see, at the end of this procedure, we pass from 30 predictors to 10 predictors:
Rows: 569
Columns: 11
$ diagnosis <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
$ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9~
$ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, ~
$ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087~
$ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902~
$ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114~
$ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0~
$ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051~
$ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791~
$ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985~
$ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0~
In this problem, the target variable diagnosis can be modeled as a Bernoulli distribution: \[ Y_i \in 0,1 \, where \, \, 1 \, \, is \, \, malignant\, \, (cancerous) \, \, and \, \, 0\, \, is\, \, benign\, \, (non \,\, cancerous) \] We consider the following Logistic Regression model with the logit as link function, which is the matematical expression that connects the value of the linear predictors with the response \(Y\). \[ Y_i \sim Bernoulli(p_i) \] \[ logit(p_i) = log\Big(\frac{p_i}{1-p_i}\Big) = \beta_1 \ Texture \ Mean \ + \beta_2 \ Area\ Mean \ + \beta_3 \ Symmetry\ Mean \ + \beta_4 \ Texture \ Se + \beta_5 \ \ Smoothness\ Se + \\ + \beta_6 \ Symmetry \ Se + \beta_7 \ Fractal \ Dimension \ Se + \beta_8 \ Smoothness \ Worst + \beta_9 \ Symmetry \ Worst + \beta_{10} \ Fractal \ Dimension \ Worst \]
One of the assumptions of logistic regression is that the features are independent of each other.
Model coefficients \(\beta_i\) are typically defined as Normal because they can take values between all real numbers. \[ \beta_i \sim N\Big(\mu = 0, \sigma^2 = \frac{1}{10^6}\Big) \] For the specification of the model, we don't consider the intercept
As we can see from the following plot, the balance of the response variable is respected after the division into train set and test set, to avoid introducing bias.
We have implemented the model using Rjags, as follow:
## Model 1 (Complete)
model <- function(){
# Likelihood
for (i in 1:N){
y[i] ~ dbern(p[i])
logit(p[i]) <- beta1*x1[i] + beta2*x2[i] +
beta3*x3[i] + beta4*x4[i] + beta5*x5[i] +
beta6*x6[i] + beta7*x7[i] + beta8*x8[i] +
beta9*x9[i] + beta10*x10[i]}
## Defining the prior beta parameters
beta1 ~ dnorm(0, 1.0E-6)
beta2 ~ dnorm(0, 1.0E-6)
beta3 ~ dnorm(0, 1.0E-6)
beta4 ~ dnorm(0, 1.0E-6)
beta5 ~ dnorm(0, 1.0E-6)
beta6 ~ dnorm(0, 1.0E-6)
beta7 ~ dnorm(0, 1.0E-6)
beta8 ~ dnorm(0, 1.0E-6)
beta9 ~ dnorm(0, 1.0E-6)
beta10 ~ dnorm(0, 1.0E-6)
}
## Passing the data for RJags
data.jags <- list("y" = y, "N" = N, "x1" = x1, "x2" = x2,
"x3" = x3, "x4" = x4, "x5" = x5,
"x6" = x6, "x7" = x7, "x8" = x8,
"x9" = x9, "x10" = x10)
## Defining parameters of interest to show after running RJags
mod.params <- c("beta1", "beta2", "beta3", "beta4",
"beta5", "beta6", "beta7", "beta8",
"beta9", "beta10")
## Bayesian Approach
## Run JAGS
## Model 1
mod_jags <- jags(data = data.jags,
n.iter = 10000,
model.file = model,
n.chains = 3,
parameters.to.save = mod.params,
n.burnin = 1000, n.thin = 10)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 456
## Unobserved stochastic nodes: 10
## Total graph size: 10063
##
## Initializing model
Running Jags with 3 chains, 10000 iterations, a burn-in of 1000 steps and a thinning of 10 we have:
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta1 | -0.0737 | 0.0445 | -0.1639 | -0.1027 | -0.0735 | -0.0436 | 0.0103 | 1.0015 | 1800 |
| beta10 | -31.5038 | 17.0101 | -64.8325 | -42.7513 | -31.8779 | -20.1080 | 0.2389 | 1.0011 | 2700 |
| beta2 | 0.0092 | 0.0011 | 0.0073 | 0.0085 | 0.0092 | 0.0099 | 0.0113 | 1.0025 | 970 |
| beta3 | -80.1357 | 10.7527 | -101.4617 | -87.2797 | -80.0974 | -72.7272 | -60.4596 | 1.0036 | 630 |
| beta4 | 0.4216 | 0.4274 | -0.4200 | 0.1360 | 0.4209 | 0.7062 | 1.2403 | 1.0018 | 1400 |
| beta5 | 41.1128 | 113.6346 | -183.2997 | -35.5522 | 40.2877 | 119.2611 | 266.5889 | 1.0006 | 2700 |
| beta6 | -103.5617 | 34.4233 | -170.3371 | -126.6671 | -103.5801 | -79.5467 | -36.9862 | 1.0008 | 2700 |
| beta7 | 379.2194 | 102.8712 | 173.1646 | 313.1684 | 384.8905 | 444.0248 | 576.8855 | 1.0011 | 2700 |
| beta8 | 10.6834 | 12.9146 | -14.2873 | 2.2903 | 10.8980 | 19.2515 | 36.4702 | 1.0010 | 2700 |
| beta9 | 36.4856 | 6.4848 | 24.0616 | 32.1903 | 36.3748 | 40.6673 | 49.8435 | 1.0013 | 2200 |
| deviance | 325.8906 | 6.4609 | 318.8481 | 322.3587 | 325.0252 | 328.4061 | 336.5689 | 1.0005 | 2700 |
We know that \(\beta_j\) measures the marginal impact of the covariate \(X_j\) on the log-odds in favor of \(Y = 1\). So \((e^{\beta_j} - 1) \times 100\) measures the percentage change rate of the odds in favor of \(Y = 1\) when the predictor \(X_j\) varies by one unit.
We have other interesting results:
Mean: point estimate for \(\beta_j\).
Sd: standard deviation for \(\beta_j\).
Rhat: is the potential scale reduction and it is a measure of the convergence of the chain; it is a comparison between chains variance and within chains variance; if they are similar they become from the same distribution. In our case the value of Rhat is always equal to 1.
n.eff: is the effective sample size that can be considered as the number of independent Monte Carlo samples necessary to same precision of the MCMC samples. Greater this value, lower the autocorrelation between the MCMC steps and better is the final approximation, because we have decreased the number of correlated samples during each iteration and we have independent samples.
DIC: is the Deviance Information Criteria, a generalization of the AIC and a measure of goodness-fit. Lower values are better. These criteria do not have an absolute scale and should be used only to rank models.
We can also compute credible intervals for the parameters.
Equi-tail intervals:
| 2.5% | 97.5% | |
|---|---|---|
| beta1 | -0.1639294 | 0.0102767 |
| beta10 | -64.8324810 | 0.2388648 |
| beta2 | 0.0073378 | 0.0113394 |
| beta3 | -101.4616525 | -60.4595676 |
| beta4 | -0.4199928 | 1.2403322 |
| beta5 | -183.2997351 | 266.5889149 |
| beta6 | -170.3371420 | -36.9862169 |
| beta7 | 173.1646178 | 576.8854976 |
| beta8 | -14.2872721 | 36.4702192 |
| beta9 | 24.0615947 | 49.8435244 |
HPD intervals:
| lower | upper | |
|---|---|---|
| beta1 | -0.1580455 | 0.0142097 |
| beta10 | -64.8503339 | 0.2657437 |
| beta2 | 0.0073160 | 0.0112938 |
| beta3 | -100.9083440 | -60.0598783 |
| beta4 | -0.3596252 | 1.2854830 |
| beta5 | -189.1564968 | 254.8279152 |
| beta6 | -168.3599686 | -35.7949675 |
| beta7 | 172.8495948 | 576.4033554 |
| beta8 | -14.3477747 | 36.4583913 |
| beta9 | 24.2406558 | 49.8828444 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| x1 | -0.0706 | 0.0433 | -1.6299 | 0.1031 |
| x2 | 0.0087 | 0.0010 | 8.6328 | 0.0000 |
| x3 | -76.4210 | 10.3103 | -7.4121 | 0.0000 |
| x4 | 0.4263 | 0.4149 | 1.0277 | 0.3041 |
| x5 | 34.7422 | 115.2116 | 0.3016 | 0.7630 |
| x6 | -100.1797 | 33.9490 | -2.9509 | 0.0032 |
| x7 | 371.9737 | 92.0941 | 4.0391 | 0.0001 |
| x8 | 10.8326 | 12.6072 | 0.8592 | 0.3902 |
| x9 | 35.0198 | 6.3987 | 5.4730 | 0.0000 |
| x10 | -31.4309 | 15.9309 | -1.9729 | 0.0485 |
We try to eliminate some features and see what happens. The model is defined using Rjags, as follow:
## Model 2 (Reduced)
model2 <- function(){
# Likelihood
for (i in 1:N){
y[i] ~ dbern(p[i])
logit(p[i]) <- beta1*x1[i] + beta2*x2[i] +
beta3*x3[i] + beta6*x6[i] + beta7*x7[i] +
beta9*x9[i]}
## Defining the prior beta parameters
beta1 ~ dnorm(0, 1.0E-6)
beta2 ~ dnorm(0, 1.0E-6)
beta3 ~ dnorm(0, 1.0E-6)
beta6 ~ dnorm(0, 1.0E-6)
beta7 ~ dnorm(0, 1.0E-6)
beta9 ~ dnorm(0, 1.0E-6)
}
## Passing the data for RJags
data.jags2 <- list("y" = y, "N" = N, "x1" = x1,
"x2" = x2, "x3" = x3,
"x6" = x6, "x7" = x7, "x9" = x9)
## Defining parameters of interest to show after running RJags
mod.params2 <- c("beta1", "beta2", "beta3",
"beta6", "beta7", "beta9")
## Bayesian Approach
## Run JAGS
## Model 2
mod_jags2 <- jags(data = data.jags2,
n.iter = 10000,
model.file = model2,
n.chains = 3,
parameters.to.save = mod.params2,
n.burnin = 1000, n.thin = 10)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 456
## Unobserved stochastic nodes: 6
## Total graph size: 6583
##
## Initializing model
Running Jags with 3 chains, 10000 iterations, a burn-in of 1000 steps and a thinning of 10 we have:
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta1 | -0.0540 | 0.0314 | -0.1167 | -0.0748 | -0.0537 | -0.0327 | 0.0054 | 1.0032 | 730 |
| beta2 | 0.0087 | 0.0010 | 0.0069 | 0.0081 | 0.0087 | 0.0093 | 0.0107 | 1.0036 | 640 |
| beta3 | -69.6686 | 8.9123 | -86.8519 | -75.4442 | -69.7594 | -63.7351 | -52.8326 | 1.0050 | 440 |
| beta6 | -62.5022 | 23.2974 | -107.6673 | -78.6265 | -62.3917 | -46.9562 | -16.4377 | 1.0006 | 2700 |
| beta7 | 244.4337 | 65.2123 | 112.2856 | 200.6755 | 246.8055 | 288.9248 | 367.5216 | 1.0010 | 2700 |
| beta9 | 27.0470 | 4.2223 | 19.1164 | 24.2326 | 27.0059 | 29.8305 | 35.5377 | 1.0038 | 600 |
| deviance | 327.7616 | 5.3962 | 322.8133 | 324.9668 | 326.8906 | 329.6372 | 336.2201 | 1.0014 | 2700 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| x1 | -0.0531 | 0.0317 | -1.6759 | 0.0938 |
| x2 | 0.0085 | 0.0009 | 9.2681 | 0.0000 |
| x3 | -67.9006 | 8.5548 | -7.9371 | 0.0000 |
| x6 | -61.8177 | 22.4817 | -2.7497 | 0.0060 |
| x7 | 247.0764 | 64.7764 | 3.8143 | 0.0001 |
| x9 | 26.2832 | 4.0435 | 6.5000 | 0.0000 |
At this point, we consider the reduced model like the best one and we go on. For the diagnostic part, we can see the plots of the Markov Chain, the density distribution of the values along the chain and the autocorrelations of each parameter.
Looking at the trace plot, it is possible to observe the trend of the parameter with respect to the iteration of one or more Markov chains. We would like to observe a stationary time series, which would mean that the trend of the parameter, on average, is almost constant in the long run (flat).
We can see that the Markov chain remains in the steady state and also the autocorrelation between consecutive values of the chain is very small. The autocorrelations drops around zero after few lags, meaning that there is not autocorrelation among different values in the chain. This happens for all the parameters but for \(\beta_1\) and \(\beta_2\) the autocorrelation decreases more slowly.
Here, we want to compute the empirical average of \(\mathbf{\hat{I}}_{t}\) increasing the value of \(t \, = \, 1,...,T\), with the following formula of \(\mathbf{\hat{I}}_{t}\):
\[ \mathbf{I} \approx \mathbf{\hat{I}}_{t} = \frac{1}{T} \sum_{i=1}^{T} h(\theta^{(i)}) \]
Now, we consider the approximation error for the MCMC sampling. We compute the square root of the MCSE.
The variance formula is: \[ \mathbf{V}[\hat{I}_{t}] = \frac{Var_{\pi}[h(X_{1})]}{t_{eff}} = \Big( 1 + 2 \sum_{k=1}^{\infty} \rho_{k}\Big)\frac{\sigma^{2}}{T} \] The model we specified earlier has these effective samples size for each parameters:| n.eff | |
|---|---|
| beta1 | 730 |
| beta2 | 640 |
| beta3 | 440 |
| beta6 | 2700 |
| beta7 | 2700 |
| beta9 | 600 |
| deviance | 2700 |
Then we want to consider the MCSE that is the square root of the formula written above, so the results are:
| MCSE | |
|---|---|
| beta1 | 0.0006048 |
| beta2 | 0.0000185 |
| beta3 | 0.1716696 |
| beta6 | 0.4671390 |
| beta7 | 1.2175976 |
| beta9 | 0.0807752 |
| deviance | 0.1040791 |
| Mean | Sd | Uncertainty | |
|---|---|---|---|
| beta1 | -0.0540273 | 0.0313761 | 0.5807454 |
| beta2 | 0.0087138 | 0.0009798 | 0.1124473 |
| beta3 | -69.6685988 | 8.9123001 | 0.1279242 |
| beta6 | -62.5022100 | 23.2974092 | 0.3727454 |
| beta7 | 244.4336595 | 65.2122575 | 0.2667892 |
| beta9 | 27.0470111 | 4.2222632 | 0.1561083 |
| deviance | 327.7616204 | 5.3961565 | 0.0164637 |
We want to understand if we achieve with these multiple simulations of the markov chains, the convergences and the validity of the stationarity regions.
Raftery and Lewis introduced an MCMC diagnostic that estimates the number of iterations needed for a given level of precision in posterior samples:
[[1]]
Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95
You need a sample size of at least 3746 with these values of q, r and s
[[2]]
Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95
You need a sample size of at least 3746 with these values of q, r and s
[[3]]
Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95
You need a sample size of at least 3746 with these values of q, r and s
We need 3746 samples to achieve these performances in these different chains.
The Geweke diagnostics is based on a test for equality of the means of the first and last part of a Markov chain (by default the first 10% and the last 50%) using a mean difference test to see if the two parts of the chain come from the same distribution (null hypothesis). If the two averages are equal it is likely that the chain will converge.
The test statistic is a standard Z-score: the difference between the two sample means divided by its estimated standard error adjusted for autocorrelation.| Z-score chain 1 | Z-score chain 2 | Z-score chain 3 | |
|---|---|---|---|
| beta1 | -0.0795599 | -0.6997935 | -0.6453352 |
| beta2 | -2.1528704 | -0.8336848 | -0.3105050 |
| beta3 | 2.1035146 | 1.0074784 | 0.5752448 |
| beta6 | 1.1653385 | 1.2320715 | -0.3730159 |
| beta7 | -1.4590239 | -1.2396231 | 1.6495889 |
| beta9 | -1.2410542 | -1.1371497 | -0.5262548 |
| deviance | 1.1065939 | 1.2856623 | 1.2016557 |
We know that trace plots provide an informal diagnostic about the convergence of the chains, we can use also another diagnostic called Gelman & Rubin diagnositc. This diagnostic calculates the variability within the chains and compares that to the variability between the chains.
The Potential Scale Factor is the statistics that tells about the chain convergence. The closer it is to 1 the better the convergence is. Values faraway from 1 indicate that the chains haven't reached yet convergence.| Point est. | Upper C.I. | |
|---|---|---|
| beta1 | 1.0005423 | 1.002736 |
| beta2 | 1.0131851 | 1.049265 |
| beta3 | 1.0141489 | 1.050655 |
| beta6 | 1.0007899 | 1.003467 |
| beta7 | 1.0009240 | 1.004519 |
| beta9 | 1.0060780 | 1.024531 |
| deviance | 0.9999842 | 1.001437 |
All the values are near 1 which indicates that all 10000 iterations are enough for convergence to the stationary distribution.
The Heidelberg and Welch diagnostic calculates a statistic test to accept or reject the null hypothesis that the Markov chain is from a stationary distribution.
The diagnostic consists of two parts:
| Stationarity test | Start iteration | P-value | Halfwidth test | Mean | Halfwidth | |
|---|---|---|---|---|---|---|
| beta1 | passed | 1 | 0.3095 | passed | -0.0524 | 0.0022 |
| beta2 | passed | 181 | 0.0532 | passed | 0.0088 | 0.0002 |
| beta3 | passed | 91 | 0.1002 | passed | -70.5577 | 0.9234 |
| beta6 | passed | 1 | 0.3277 | passed | -62.8875 | 1.7414 |
| beta7 | passed | 1 | 0.1981 | passed | 243.5105 | 5.4210 |
| beta9 | passed | 1 | 0.0749 | passed | 27.3031 | 0.3413 |
| deviance | passed | 1 | 0.1402 | passed | 327.7822 | 0.4135 |
At this point, after checking that the simulations pass the convergence, we decide to compute credible intervals and the point estimates for our beta values estimated.
Point Estimates:
| Point Estimates | |
|---|---|
| beta1 | -0.0540273 |
| beta2 | 0.0087138 |
| beta3 | -69.6685988 |
| beta6 | -62.5022100 |
| beta7 | 244.4336595 |
| beta9 | 27.0470111 |
| deviance | 327.7616204 |
Equi-tail intervals:
| 2.5% | 97.5% | |
|---|---|---|
| beta1 | -0.1166939 | 0.0053871 |
| beta2 | 0.0069013 | 0.0106727 |
| beta3 | -86.8518876 | -52.8325545 |
| beta6 | -107.6673089 | -16.4377104 |
| beta7 | 112.2856068 | 367.5216493 |
| beta9 | 19.1163814 | 35.5376515 |
HPD intervals:
| lower | upper | |
|---|---|---|
| beta1 | -0.1144907 | 0.0072405 |
| beta2 | 0.0067775 | 0.0105161 |
| beta3 | -85.7848650 | -52.0747865 |
| beta6 | -105.0492906 | -14.4295950 |
| beta7 | 121.7873472 | 374.0515537 |
| beta9 | 19.1100975 | 35.5459352 |
Let's compare the models using the DIC (Deviance Information Criterion), that is defined as follow:
\(DIC = D(\hat\theta) - 2 \, p_D\) where: \(D(\hat\theta) = -2 \, log\, f(y|\theta)\) is the deviance of the model and \(p_D\) is the number of effective ("unconstrained") parameters in the model.
| DIC | |
|---|---|
| Model 1 (complete) | 346.7766 |
| Model 2 (reduced) | 342.3315 |
Here, we can observe the predictions with the second model using bayesian approach and see the performance on the unseen data using the confusion matrix, that is a tool for summarizing the performance of a classification method. Starting from this point, we can define different mertics that will be useful later on.
It's important to define the False Positive and False Negative that are the two types of errors. We can also think in terms of hypothesis testing and fix that:
Negative (-): Null hypothesis
Positive (+): Alternative hypothesis
In this case, if we fix a patient with the disease as positive, a false positive will be a patient who is actually negative but is incorrectly classified as sick. This type of corresponds to the first type of error (\(\alpha\)), that is rejecting the null hypothesis when it is true.
On the other hand, if a patient is actually sick and is incorrectly classified as not sick, we have a false negative, which corresponds to the second type error (\(\beta\)), that is non rejecting the null hypothesis when this is false.
Obviously, there is not always a balance between the two components of error, often we need to find the right trade-off. In this analysis, I think that is more important to minimize false negatives than false positives. In fact, if we generated more false negatives, it would mean not treating a person who is actually sick and therefore probably causing them to die. On the other side, if we generated more false positives, it would mean treating a patient who does not need it, because he is healthy. Maybe this problem could be solved with a enhanced analysis, possibly more in-depth for the actual evaluation of false positives, so as to be really sure if a person being held in hospital actually needs treatment.
I think it is useful to follow this approach, because especially in diseases of easy and fast development such as tumors, classifying a patient as healthy when it is actually not it means subjecting the patient to an annual check-up (for example), but by then it may be too late to intervene with a cure.
At this point, to minimize false negatives we consider sensitivity, that is the fraction of true positives correctly classified, since minimizing false negatives corresponds to maximizing true positives, this means that if a patient is actually sick (positive) I don't want to wrong classification. Obviously, we also want to keep the percentage of false positives under control, because we don't want to keep too many non-sick people in the hospital for further checks, because this would mean having more unnecessary costs and an overcrowding of the structure.
For this reason, as the first goal we would want to maximize sensitivity, but we also try to get a good score for specificity. We can try to balance this trade-off with different threshold values.
It is really interesting to note how the value of the threshold can affect the classification error.
In the last two cases, we see how as the threshold increases, the sensitivity decreases, unlike the specificity and therefore we cannot consider these last two methods as optimal for the goal we want to achieve.
At this point, it is clear that the choice is between \(t = 0.2\) or \(t = 0.3\). We decide for model with \(t = 0.2\), which is the right compromise between the two goals, because surely we will have a lower number of false negatives compared to a small increase in false positives. Obviously, we will have a greater number of false positives, therefore people who in vain will wait for additional analyzes for the correct classification of their health status, but as they say in these cases prevention is better than cure.
The ROC curve is a graphic scheme for a binary classifier, which can also be interpreted as a probability curve that plots the sensitivity and (1 - specificity) along the two axes, respectively represented by True Positive Rate (TPR, fraction of true positives) and False Positive Rate (FPR, fraction of false positives) - also known as Fall-Out - at various threshold values and separates the signal from the noise, allowing to study the relationships between instances actually positive (hit rate) and false alarms.
Through the analysis of the curves, the discriminatory capacity of a classifier is assessed between a set of positive and negative samples, calculating the area under the ROC curve, or the Area Under Curve (AUC). The AUC value assumes values in the range [0;1] and it is equivalent to the probability that the result of the classifier applied to an instance extracted at random from the group of positives is higher than that obtained by applying it to an instance extracted at random from the group of negatives.
In the following plot, we show specificity (the fraction of correctly classified true negatives) on the x-axis and sensitivity on the y-axis. We can derive these quantities with the following formulas:
\(True\, Positive\, Rate\, (TPR) = \frac{TP}{P} \,\,\,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\, \,\,\,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\, \,\,\,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\, \,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\,\, \,\,\,\,\,\,\,\,\,\, True\, Negative\, Rate\, (TNR) = \frac{TN}{N} = (1 - FPR)\)
Even looking at the ROC curve, it seems obvious to choose a value of \(t = 0.7\), which guarantees an \(AUC = 0.80\). The problem, in this case, is that this value is given more by specificity than by sensitivity, which in fact is below 0.75.
For this reason, we confirm the choice of \(t = 0.2\), with \(AUC = 0.77\), a \(sensitivity = 0.93\) and a \(specificity = 0.62\)
It is important to remember that the results obtained from this forecast are dependent on the specific split in training and testing done previously; this means that with another split procedure, that is with another training and test set, the performances will be different. For this reason, especially for future studies, a cross-validation method can be used to have a more robust and averaged evaluation of the method used.
Here, we can see a summary of all metrics for the different thresholds:| Sensitivity | Specificity | Precision | F1 | AUC | |
|---|---|---|---|---|---|
| t = 0.1 | 0.9523810 | 0.4647887 | 0.5128205 | 0.6666667 | 0.7085848 |
| t = 0.2 | 0.9285714 | 0.6197183 | 0.5909091 | 0.7222222 | 0.7741449 |
| t = 0.3 | 0.8809524 | 0.6478873 | 0.5967742 | 0.7115385 | 0.7644199 |
| t = 0.4 | 0.8571429 | 0.7183099 | 0.6428571 | 0.7346939 | 0.7877264 |
| t = 0.5 | 0.7619048 | 0.7605634 | 0.6530612 | 0.7032967 | 0.7612341 |
At this point, using the set of features of the reduced model, we want to implement another linear classifier, namely a probit model to study its diagnostics and forecasting capacity.
We have implemented the model using Rjags, as follow:
## Model Probit
model_probit <- function(){
# Likelihood
for (i in 1:N){
y[i] ~ dbern(p[i])
p[i] <- phi(beta1*x1[i] + beta2*x2[i] +
beta3*x3[i] + beta6*x6[i] + beta7*x7[i] +
beta9*x9[i])}
## Defining the prior beta parameters
beta1 ~ dnorm(0, 1.0E-6)
beta2 ~ dnorm(0, 1.0E-6)
beta3 ~ dnorm(0, 1.0E-6)
beta6 ~ dnorm(0, 1.0E-6)
beta7 ~ dnorm(0, 1.0E-6)
beta9 ~ dnorm(0, 1.0E-6)
}
## Passing the data for RJags
data.jags_probit <- list("y" = y, "N" = N, "x1" = x1,
"x2" = x2, "x3" = x3, "x6" = x6,
"x7" = x7, "x9" = x9)
## Defining parameters of interest to show after running RJags
mod.params_probit <- c("beta1", "beta2", "beta3",
"beta6", "beta7", "beta9")
## Bayesian Approach
## Run JAGS
## Model Probit
## Bayesian Approach
mod_jags_probit <- jags(data = data.jags_probit,
n.iter = 10000,
model.file = model_probit,
n.chains = 3,
parameters.to.save = mod.params_probit,
n.burnin = 1000, n.thin = 10)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 456
## Unobserved stochastic nodes: 6
## Total graph size: 6583
##
## Initializing model
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta1 | -0.0292 | 0.0182 | -0.0652 | -0.0416 | -0.0292 | -0.0169 | 0.0054 | 1.0015 | 1900 |
| beta2 | 0.0049 | 0.0005 | 0.0040 | 0.0046 | 0.0049 | 0.0052 | 0.0059 | 1.0067 | 320 |
| beta3 | -39.2792 | 4.7924 | -48.9372 | -42.4228 | -39.1972 | -36.1112 | -30.2259 | 1.0013 | 2300 |
| beta6 | -37.2654 | 12.3441 | -61.7162 | -45.6099 | -37.3259 | -29.1238 | -12.8072 | 1.0009 | 2700 |
| beta7 | 140.8234 | 36.9233 | 67.0920 | 117.0872 | 141.5007 | 165.2947 | 212.4105 | 1.0061 | 850 |
| beta9 | 15.2782 | 2.2708 | 11.1024 | 13.7799 | 15.1856 | 16.7505 | 19.8330 | 1.0006 | 2700 |
| deviance | 327.3179 | 5.0265 | 322.4518 | 324.6194 | 326.5371 | 329.0299 | 336.1120 | 1.0007 | 2700 |
Even for the probit model, we understand the optimal value of \(t\), by looking at confusion matrix and ROC curve.
For the probit model, the correct choice would seem to be \(t = 0.3\), with \(sensitivity = 0.93\) and \(specificity = 0.55\).
Here, we can see a summary of all metrics for the different thresholds:| Sensitivity | Specificity | Precision | F1 | AUC | |
|---|---|---|---|---|---|
| t = 0.1 | 1.0000000 | 0.1830986 | 0.4200000 | 0.5915493 | 0.5915493 |
| t = 0.2 | 0.9523810 | 0.4084507 | 0.4878049 | 0.6451613 | 0.6804158 |
| t = 0.3 | 0.9285714 | 0.5492958 | 0.5492958 | 0.6902655 | 0.7389336 |
| t = 0.4 | 0.8809524 | 0.6619718 | 0.6065574 | 0.7184466 | 0.7714621 |
| t = 0.5 | 0.7142857 | 0.7183099 | 0.6000000 | 0.6521739 | 0.7162978 |
As we can see, the diagnosis of the model tells us that the parameters reach convergence, as in the case of the logistic model. Even when comparing the DIC values, they are quite similar to each other.
| DIC | |
|---|---|
| Model Logit (complete) | 346.7766 |
| Model Logit (reduced) | 342.3315 |
| Model Probit (reduced) | 339.9573 |
| Logit | Probit | |
|---|---|---|
| beta1 | -0.0540273 | -0.0466982 |
| beta2 | 0.0087138 | 0.0078528 |
| beta3 | -69.6685988 | -62.8467892 |
| beta6 | -62.5022100 | -59.6246831 |
| beta7 | 244.4336595 | 225.3173729 |
| beta9 | 27.0470111 | 24.4450615 |
| deviance | 327.7616204 | 523.7086160 |
Now, we understood that the reduced logit is better than the reduced probit if we consider the predictions. So, for checking that the model proposed can correctly recover the model parameters we can execute the simulation considering the data simulated from the model proposed, the beta parameters checked are the estimated from the first model, considering these values we can continue with our last purpose:
We have implemented the model using Rjags, as follow:
## Model Logit (Simulated)
model_simulated <- function(){
# Likelihood
for (i in 1:N){
y[i] ~ dbern(p[i])
logit(p[i]) <- beta1_sim*x1_simulation[i] +
beta2_sim*x2_simulation[i] +
beta3_sim*x3_simulation[i] +
beta6_sim*x6_simulation[i] +
beta7_sim*x7_simulation[i] +
beta9_sim*x9_simulation[i]}
## Defining the prior beta parameters
beta1_sim ~ dnorm(0, 1.0E-6)
beta2_sim ~ dnorm(0, 1.0E-6)
beta3_sim ~ dnorm(0, 1.0E-6)
beta6_sim ~ dnorm(0, 1.0E-6)
beta7_sim ~ dnorm(0, 1.0E-6)
beta9_sim ~ dnorm(0, 1.0E-6)
}
## Passing the data for RJags
data.jags_sim <- list("y" = y_simulation,
"N" = N,
"x1_simulation" = x1_simulation,
"x2_simulation" = x2_simulation,
"x3_simulation" = x3_simulation,
"x6_simulation" = x6_simulation,
"x7_simulation" = x7_simulation,
"x9_simulation" = x9_simulation)
## Defining parameters of interest to show after running RJags
mod.params_sim <- c("beta1_sim", "beta2_sim", "beta3_sim",
"beta6_sim", "beta7_sim", "beta9_sim")
## Bayesian Approach
## Run JAGS
## Model Logit Simulated
mod_jags_sim <- jags(data = data.jags_sim,
n.iter = 20000,
model.file = model_simulated,
n.chains = 3,
parameters.to.save = mod.params_sim,
n.burnin = 1000, n.thin = 20)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 569
## Unobserved stochastic nodes: 6
## Total graph size: 6958
##
## Initializing model
Running Jags with 3 chains, 20000 iterations, a burn-in of 1000 steps and a thinning of 20 we have:
As we can see, the parameters are very similar from the reduced logit and this model is able to get the model parameters correctly based on simulated data.
| mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | Rhat | n.eff | |
|---|---|---|---|---|---|---|---|---|---|
| beta1_sim | -0.0037 | 0.0323 | -0.0650 | -0.0261 | -0.0037 | 0.0181 | 0.0586 | 1.0024 | 1100 |
| beta2_sim | 0.0085 | 0.0009 | 0.0068 | 0.0079 | 0.0085 | 0.0091 | 0.0103 | 1.0102 | 240 |
| beta3_sim | -84.7606 | 8.6898 | -102.3195 | -90.2722 | -84.5709 | -79.0609 | -68.8216 | 1.0097 | 240 |
| beta6_sim | -99.7275 | 21.9301 | -143.5143 | -114.8321 | -99.2103 | -84.6228 | -58.5241 | 1.0052 | 430 |
| beta7_sim | 269.2955 | 62.6820 | 148.3840 | 227.2848 | 268.7020 | 310.3443 | 394.2221 | 1.0014 | 2800 |
| beta9_sim | 35.1551 | 4.0926 | 27.8540 | 32.4818 | 35.0112 | 37.7974 | 43.3874 | 1.0040 | 570 |
| deviance | 310.9338 | 8.1653 | 305.7895 | 308.0786 | 309.9588 | 312.6673 | 319.5757 | 1.0025 | 1600 |
After this analysis it is clear the importance of understanding the problem you are looking at rather than trying different methods to get the best performance. In fact, even with a simple logistic regression we have good metric values.
Furthermore, it is also important to understand that it will be difficult to obtain a model that maximizes all the metrics in question, so we need to define the goal to be achieved and look at the metrics that interest us.
Of course, we can try to improve performance in several ways: